This notebook demonstrates using the AutoML API for Language Sentiment.
The audience for this notebook are software engineers (SWE) with limited experience in machine learning (ML).
One should be familar with:
- Python 3.X
- Google Cloud Platform (GCP) and using GCP buckets.
- Concept of Natural Language Processing for Sentiment Analysis.
This notebook using the Kaggle dataset for PetFinder Adoption Prediction, located at:
https://www.kaggle.com/c/petfinder-adoption-prediction/data
From the Kaggle web page, you will need to select the Download for train_sentiment_zip. Once downloaded to your laptop, you will need to unzip the folder. Once unzipped, there will be a subfolder named train_sentiment, which will contain several JSON files, each of which is a subset of instances (examples).
Each instance will contain:
text.content - text description by the owner
sentiment.score - how favorable the description is (between 0 and 1 in increments of 0.1
The objective of this tutorial is to learn how to use the AutoML API to train a model for sentiment analysis, deploy the model and do predictions using a gRPC or REST API interface.
This tutorial uses billable components of AutoML Language.
Learn about AutoML Language Pricing
If you are using Colab or AI Platform Notebooks, your environment already meets all the requirements to run this notebook. You can skip this step.
Otherwise, make sure your environment meets this notebook's requirements. You need the following:
The Google Cloud guide to Setting up a Python development environment and the Jupyter installation guide provide detailed instructions for meeting these requirements. The following steps provide a condensed set of instructions:
Install AutoML SDK using the pip install google-cloud-automl
command in a shell.
Install virtualenv and create a virtual environment that uses Python 3.
Activate that environment and run pip install jupyter
in a shell to install
Jupyter.
Run jupyter notebook
in a shell to launch Jupyter.
Open this notebook in the Jupyter Notebook Dashboard.
The following steps are required, regardless of your notebook environment.
Enter your project ID in the cell below. Then run the cell to make sure the Cloud SDK uses the right project for all the commands in this notebook.
Note: Jupyter runs lines prefixed with !
as shell commands, and it interpolates Python variables prefixed with $
into these commands.
Jupyter runs lines prefixed with %
as automagic commands, which are interpreted within your IPython session. Automagic commands include %ls
, %pwd
, %env
and %pip
for example.
In [ ]:
PROJECT_ID = "[your-project-id]" #@param {type:"string"}
#ANDY
PROJECT_ID="andy-1234-221921"
!gcloud config set project $PROJECT_ID
If you are using AI Platform Notebooks, your environment is already authenticated. Skip this step.
If you are using Colab, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.
Otherwise, follow these steps:
In the GCP Console, go to the Create service account key page.
From the Service account drop-down list, select New service account.
In the Service account name field, enter a name.
From the Role drop-down list, select Machine Learning Engine > AI Platform Admin and Storage > Storage Object Admin.
Click Create. A JSON file that contains your key downloads to your local environment.
Enter the path to your service account key as the
GOOGLE_APPLICATION_CREDENTIALS
variable in the cell below and run the cell.
In [ ]:
import sys
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.
if 'google.colab' in sys.modules:
from google.colab import auth as google_auth
google_auth.authenticate_user()
# If you are running this notebook locally, replace the string below with the
# path to your service account key and run this cell to authenticate your GCP
# account.
else:
%env GOOGLE_APPLICATION_CREDENTIALS your_path_to_credentials.json
#ANDY
%env GOOGLE_APPLICATION_CREDENTIALS /Users/aferlitsch/Desktop/andy-1234-9098d282d257.json
In [ ]:
!gcloud auth login
!gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="[user:your-userid@your-domain]" \
--role="roles/automl.admin"
!gcloud projects add-iam-policy-binding $PROJECT_ID \
--member="[serviceAccount:service-account-name]" \
--role="roles/automl.editor"
The following steps are required, regardless of your notebook environment.
When you submit a training job using the AutoML Language SDK, you must store the training data in a GCS bucket
Set the name of your Cloud Storage bucket below. It must be unique across all Cloud Storage buckets.
You may also change the COMPUTE_REGION
variable, which is used for operations
throughout the rest of this notebook. Make sure to choose a region where Cloud
AI Platform services are
available. You may
not use a Multi-Regional Storage bucket for training with AI Platform.
In [ ]:
BUCKET_NAME = PROJECT_ID + "-lcm" #@param {type:"string"}
Only if your bucket doesn't already exist: Run the following cell to create your Cloud Storage bucket.
In [ ]:
# Default compute region for AutoML
COMPUTE_REGION='us-central1'
! gsutil mb -l $COMPUTE_REGION gs://$BUCKET_NAME
Finally, validate access to your Cloud Storage bucket by examining its contents:
In [ ]:
! gsutil ls -al gs://$BUCKET_NAME
In [ ]:
%pip install -U google-cloud-storage
In [ ]:
import tensorflow as tf
import numpy as np
# import the Google AutoML client library
from google.cloud import automl_v1beta1 as automl
We will have to convert the JSON data into a CSV file.
1. cd into the folder with the training data (train_sentiment)
2. Read the contents of each *.json file (using json.load())
A. Find each sentence and sentiment pair
B. Map sentiment values of 0.1, 0.2 ..; to integer range 1, 2, ...
B. Add them to the CSV file
In [ ]:
# Set the path location of the training data
TRAIN_PATH='[my-path-to-train-data]'
TRAIN_PATH='/Users/aferlitsch/Downloads/test_sentiment'
import os, json
# Create an empty CSV file
csv_file = open('petfinder.csv', 'w')
# Scan for each *.json file
os.chdir(TRAIN_PATH)
for file in os.scandir():
if file.is_file():
print(file.path)
with open(file.path, 'r') as f:
# Read in as JSON object {python dictionary}
try:
obj = json.load(f)
except: continue
# Get the sentence/sentiment pairs
sentences = obj['sentences']
for sentence in sentences:
text = '"' + str(sentence['text']['content']) + '"'
sentiment = str(int(sentence['sentiment']['score'] * 10 + 10) // 2)
# Write the sentiment/sentence pair to a file.
csv_file.write(text + ',' + sentiment + '\n')
csv_file.close()
Copy the CSV file to your GCS bucket.
In [ ]:
CSV_DATASET = "gs://" + BUCKET_NAME + "/csv/petfinder.csv"
!gsutil cp petfinder.csv $CSV_DATASET
In [ ]:
# Create an AutoML client
client = automl.AutoMlClient()
# Derive the full GCP path to the project
project_location = client.location_path(PROJECT_ID, COMPUTE_REGION)
A dataset contains representative samples of the type of content you want to analyze for sentiment, labeled with the sentiment ratings you want your custom model to use. The dataset serves as the input for training a model.
The main steps for building a dataset are:
- Specify a name for the dataset.
- Import data items into the dataset.
The first step in creating a custom model is to create an empty dataset that will eventually hold the training data for the model.
In [ ]:
# Specify a name for the dataset
DATASET_NAME="[my-dataset-name]"
DATASET_NAME="pet_finder"
# Specify the integer range for the sentiment rating ( 1 thru 10 for -1.0 .. 1.0 in increments of 0.1)
dataset_metadata = {"sentiment_max": 10}
# Set dataset name and metadata of the dataset.
my_dataset = {
"display_name": DATASET_NAME,
"text_sentiment_dataset_metadata": dataset_metadata,
}
# Create a dataset with the dataset metadata in the region.
response = client.create_dataset(project_location, my_dataset)
Display response for creating an empty dataset.
In [ ]:
# Display the dataset information.
print("Dataset name: {}".format(response.name))
print("Dataset id: {}".format(response.name.split("/")[-1]))
print("Dataset display name: {}".format(response.display_name))
print("Image classification dataset metadata:")
print("\t{}".format(response.image_classification_dataset_metadata))
print("Dataset example count: {}".format(response.example_count))
# Save the dataset ID
dataset_id = response.name.split("/")[-1]
In [ ]:
# Get the full path of the dataset.
dataset_full_id = client.dataset_path(
PROJECT_ID, COMPUTE_REGION, dataset_id
)
# Specify the location of the CSV file for the dataset
input_config = {"gcs_source": {"input_uris": [CSV_DATASET]}}
# Import data from the input URI.
response = client.import_data(dataset_full_id, input_config)
Display response from initiating the import of images into the dataset. Call will return when import has completed. This may take upto 20 minutes
In [ ]:
# synchronous check of operation status.
print("Data imported. {}".format(response.result()))
In [ ]:
response = client.list_datasets(project_location, None)
print("List of datasets:")
for dataset in response:
# Display the dataset information.
print("Dataset name: {}".format(dataset.name))
print("Dataset id: {}".format(dataset.name.split("/")[-1]))
print("Dataset display name: {}".format(dataset.display_name))
print("Text Sentiment dataset metadata:")
print("\t{}".format(dataset.text_sentiment_dataset_metadata))
print("Dataset example count: {}\n".format(dataset.example_count))
You create a custom model by training it using a prepared dataset. AutoML API uses the items from the dataset to train the model, test it, and evaluate its performance. You review the results, adjust the training dataset as needed, and train a new model using the improved dataset.
Training a model can take several hours to complete. The AutoML API enables you to check the status of training.
In [ ]:
# Specify a name for your model.
MODEL_NAME="[your-model-name]"
# ANDY
MODEL_NAME="devrelres_sent"
# Set model name and model metadata for the image dataset.
my_model = {
"display_name": MODEL_NAME,
"dataset_id": dataset_id,
"text_sentiment_model_metadata": {}
}
# Create a model with the model metadata in the region.
response = client.create_model(project_location, my_model)
Display response from initiating the training of the model.
In [ ]:
print("Training operation name: {}".format(response.operation.name))
Display response from initiating the training of the model. Call will return when training has completed. This may take upto 1 hour
In [ ]:
# synchronous check of operation status.
print("Training done. {}".format(response.result()))
# Save the model ID
model_id = response.result().name.split("/")[-1]
In [ ]:
from google.cloud.automl_v1beta1 import enums
# Get the full path of the model.
model_full_id = client.model_path(PROJECT_ID, COMPUTE_REGION, model_id)
# Get complete detail of the model.
model = client.get_model(model_full_id)
# Retrieve deployment state.
if model.deployment_state == enums.Model.DeploymentState.DEPLOYED:
deployment_state = "deployed"
else:
deployment_state = "undeployed"
# Display the model information.
print("Model name: {}".format(model.name))
print("Model id: {}".format(model.name.split("/")[-1]))
print("Model display name: {}".format(model.display_name))
print("Text Sentiment model metadata:")
print(
"Training cost: {}".format(
model.text_sentiment_model_metadata.train_cost
)
)
print(
"Stop reason: {}".format(
model.text_sentiment_model_metadata.stop_reason
)
)
print(
"Base model id: {}".format(
model.text_sentiment_model_metadata.base_model_id
)
)
print("Model deployment state: {}".format(deployment_state))
After training a model, AutoML Language uses items from the TEST set to evaluate the quality and accuracy of the new model. For more information on how to interpret the evaluation, see Evaluating models
In [ ]:
# Get the full path of the model.
model_full_id = client.model_path(PROJECT_ID, COMPUTE_REGION, model_id)
# List all the model evaluations in the model by applying filter.
response = client.list_model_evaluations(model_full_id, None)
print("List of model evaluations:")
for element in response:
print(element)
In [ ]: